AI Software Stack

Eagle-N is optimized for processing complex AI workloads such as video and text analysis, supported by a distributed software stack spanning both the host machine and the Eagle-N accelerator.

The BOS AI software stack is deployed across a Linux-based host system and the Eagle-N SoC, connected via a high-bandwidth PCIe Gen5 interface. The host AP, either x86 or ARM-based, manages system orchestration, data handling, and model execution flow, while Eagle-N performs compute-intensive neural network inference.

Execution flow of AI workloads:

The host machine sends input data (camera, sensor, or text streams) to Eagle-N via PCIe
Eagle-N executes neural network models to do inference using its multi-cluster NPU
Results are returned to the host machine for further decision-making or service integration

Host machine and Eagle-N device software stack configuration:

Support for both x86 and ARM host machine platforms
Require bare-metal Linux and Android envrionment on the host machine
Two different software stacks: one for the host machine and the other for the Eagle-N
Support functional safety aligned with ISO 26262 for reliable operations

This tightly integrated hardware-software architecture enables efficient workload distribution, high-throughput data processing, and robust system reliability for automotive and edge AI applications.

BOS Software stack

Tenstorrent-BOS AI Software

Tenstorrent's layered Software Architecture

TT-Forge™: MLIR-Based Compiler

TT-Forge™ is Tenstorrent’s Multi-Level Intermediate Representation (MLIR)-based compiler. It bridges high-level machine learning frameworks with the Tenstorrent software stack.

Use TT-Forge™ to compile models from frameworks such as PyTorch, JAX, and TensorFlow for execution on Tenstorrent hardware. It offers an automated, general path to run many types of model architectures without requiring custom kernel development. TT-Forge™ integrates with and lowers to TT-Metalium for hardware execution.

TT-NN™: A Python & C++ Neural Network OP library

TT-NN™ is a library of neural network operations that provides a user-friendly interface for running models on Tenstorrent hardware. It is designed to be intuitive for developers familiar with PyTorch.

Use TT-NN™ to run AI models using a familiar, high-level Python API without managing the complexities of the underlying hardware. TT-NN™ builds upon TT-Metalium™ and provides a stable set of pre-packaged, optimized operations. It is also available with a C++ API.

TT-Metalium™: Programming Tenstorrent Hardware

TT-Metalium™ is the low-level, open-source software development kit (SDK) that provides developers direct access to Tenstorrent hardware. It is a bare-metal programming environment designed for users who must write custom C++ kernels for machine learning or other high-performance computing workloads. It is thought of as the same category as Nvidia's CUDA or AMD's HIP. But while TT-Metalium offers low-level access to hardware features, it comes at the price of a new programming model.

Use TT-Metalium™ when you require complete control over the hardware to optimize code for performance, explicitly manage memory, or implement novel operations not found in standard libraries. This environment exposes the RISC-V processors, the Network-on-Chip (NoC), and the matrix and vector engines within each Tensix core.

TT-LLK: Low-level kernels

TT-LLK provides low-level kernels that serve as serve as foundational compute primitives, acting as building blocks for higher-level software stacks that implement machine learning (ML) operations.

Advantages of BOS Neural Network Software

Eagle-N’s NPU system focuses fully on parallel programming. It achieves high-performance inference and utilization on current AI models while remaining flexible and programmable. It supports efficient tile-based compute and data movement, providing interleaved and shared buffer management for compute operations such as element-wise, matmul, reduction, and window-based operations.

Four major advantages of Eagle-N’s NPU system are:

Near Memory Compute and Efficient Use of SRAM

Eagle-N’s NPU is composed of 24 AI compute engines, called Tensix cores, each with 5 RISC-V CPUs, forming a mesh topology.

Each Tensix core operates on its local SRAM.
Tensix cores are connected via two NoCs (Networks-on-Chip).
Each core can communicate with:
- Any other Tensix core in the mesh
- Off-chip DRAM
The entire SRAM is used as intermediate storage between operations.

Explicit and Decoupled Data Movement

The performance and efficiency of data movement in AI are as important as the raw compute capacity of the math engines.

In Tensix, data movement is explicit and decoupled from compute.
Separate data movement engines in each Tensix core:
- Transfer data from neighboring cores or off-chip DRAM
- Store data into local SRAM
Data movement RISC-V processors:
- Issue asynchronous, tile-sized data movement instructions
- Enable a large number of outstanding transfers
- Operate concurrently with the compute engine

Flexible Performance Optimization Support

tt-nn (high-level): Python-based, PyTorch-like interface with built-in optimizations and minimal customization.
tt-nn + bare-metal: Combines high-level APIs with custom kernel development for deeper optimization.
Bare-metal: Full low-level programming in C++ for maximum performance, targeting HPC and highly optimized kernels.

Flexible Configurations of Tensix Clusters

Enabling the configuration of multiple model execution environment
Hardware supported physical partitiioning supporting freedom from fnterference (FFI) - automotive safery ready)

Execution flow of AI workloads:​

Host machine and Eagle-N device software stack configuration:​

Tenstorrent-BOS AI Software​

TT-Forge™: MLIR-Based Compiler​

TT-NN™: A Python & C++ Neural Network OP library​

TT-Metalium™: Programming Tenstorrent Hardware​

TT-LLK: Low-level kernels​

Advantages of BOS Neural Network Software​

Near Memory Compute and Efficient Use of SRAM​

Explicit and Decoupled Data Movement​

Flexible Performance Optimization Support​

Flexible Configurations of Tensix Clusters​